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ESTIMATES  OP  THE  ROUNDOFF  ERROR  IN  THE  SOLUTION 
OF  A  SYSTEM  OF  CONDITIONAL  EQUATIONS 

by 

» 

V.  I.  Gordonova 

Zh.  Vychislitel.  Matem.  i  Matem.  Fiziki,  Vcl.  9>  No.  4 
July-August  1969 ,  PP-  775-83 

In  the  present  work  we  will  examine  estimates  of  the  equivalent 
perturbation  of  roundoff  errors  in  the  solution  of  a  system  of  condi¬ 
tional  equations  by  the  method  of  least  squares  (Method  A)  and  by  a 
method  which  was  proposed  by  D.  K.  FadcLeev,  V.  N.  Faddeeva,  and  V.  N. 
Kublanovskaya  in  a  joint  report  at  a  conference  on  numerical  methods 
in  Kiev  in  1966  (Method  B). 

Let  us  examine  the  system  of  conditional  equations: 

Ax  -  f  (1) 

with  a  rectangular  matrix  A  having  N  rows  and  n  columns,  where 
generally  N  »  n.  Method  A  leads  to  the  system  of  normal  equations 

ATAx  -  ATf  (2) 

r 

with  a  symmetric  positive  definite  matrix  A  A  of  rank  n.  We  will 
assume  that  the  solution  of  (2)  is  found  by  the  method  of  square  roots, 
always  taking  advantage  of  the  accumulation  of  scalar  products,  independ¬ 
ently  of  how  one  computes  the  elements  of  system  (2). 

Method  B  leads  to  a  left  orthogonal  transformation  of  (l)  into 

Px  .  I  (2') 


The  term  "equivalent  perturbation"  seems  to  refer  to  inverse  roundoff 
analysis. 


where  P  =  QA,  i  =  Qf,  matrix  P  has  non-null  elements  only  in  the  right 
upper  triangle  P  of  rank  n.  Let  i  be  the  vector  whose  components  are 
the  first  n  components  of  the  vector  Qf.  The  triangular  system 

Px  =  i  (3) 

is  equivalent  to  system  (2). 

Hie  total  error  in  both  methods  is  composed  of  the  roundoff  error 
in  reading  in  the  coefficients  and  the  right-hand  terms  of  (2)  and  (3) 
and  the  roundoff  error  during  the  solving  of  these  systems.  Since 
triangular  systems  may  be  solved  very  exactly  ([1,  Chapter  k]),  we  can 
neglect  the  roundoff  error  in  the  solution  of  (3)  and  in  the  back- 
solution  part  of  the  method  of  square  roots  in  the  solution  of  (2). 

Because  of  the  equivalence  of  (2)  and  (3)  it  does  not  matter  whether 
one  calculates  the  equivalent  perturbation  of  roundoff  errors  of  Jfethods 
A  and  B  in  terms  of  (2)  or  (3).  We  will  do  the  calculations  in  terms  of 
system  (2)  since  this  is  more  convenient.  Everywhere  below,  if  it  is  not 
specifically  stated,  we  will  use  the  symbols  adopted  in  [1]  and  the 
Euclidean  norm  of  the  matrices  and  vectors. 

l)  Let  us  examine  in  the  first  place  the  errors  of  Method  A. 

Because  of  the  roundoff  in  the  calculation  of  the  scalar  products,  the 

T  T 

elements  of  the  matrix  A  A  and  the  vector  A  f  will  be  obtained  with 
a  certain  error;  i.e.,  instead  of  (2)  we  obtain 

Bx  =  k  (k) 

where  B  =  ATA  +  A(ATA),  k  =  ATf  +  A(ATf). 
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T  T 

The  norms  of  the  error  matrix  A(A  A)  and  the  error  vector  a(A  f) 
essentially  depend  on  the  method  of  calculating  of  scalar  products  in 
the  machine . 

In  the  carrying  out  of  all  operations  in  a  machine  with  a  t-digit 

T  T 

accuracy,  the  elements  of  A(A  A)  and  A(A  f),  which  we  will  designate 
respectively  by  Ab^^  and  Ak^,  may  be  estimated  on  the  basis  of  [1, 
Chapter  3]  as 

lAb^l  <  N2  1  ||ai  ||  Hajll  ,  |  Akj  |  <  N2  ^  HaJI  ||f||  , 

if  the  calculations  are  executed  with  floating  point  (fl).  Here  and 
later  t^  »  t  -  0.03406,  and  a^  is  the  i-th  column  of  the  matrix  A. 
Hence,  we  obtain 


IWatA)||  <  m'N  £  II.J2  ||a  f)1'2 


te\  £  ||a  ||2  £  ||. 

i  J  J 


ll2\1/2  _  MO  1  llfill2 


II A(  ATf ||  <  N2  1  ||A||  Jlfjt. 

If  the  calculations  are  executed  in  fixed  point  (fi),  we  get 
c  orre  spondi ngly 

||a(ATA)||  <  Nn2't'1,  |lA(ATf)||  <  Nn1/2  2"t_1  . 

Here  it  is  assumed  ||a^||  <  l-K2-t_1,  ||f||  <  1-N2*t"1,  which  guarantees 

the  possibility  of  calculating  in  fixed  point. 

If  the  scalar  products  are  calculated  with  double  precision,  then 
the  estimate  under  consideration  is  practically  independent  of  N.  In 
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particular,  in  the  case  of  floating  point  (fjfc^),  according  to  [l. 
Chapter  3], 


I  ^  o-t  /Tv  3  ATO-2t+0.08406  I 

J.  .  <  2  (a.  a . )  +  ^  N2  i  a. 

ij  '  -  v  1  j'  2  "  1' 


3 

Assuming  *  K2  <0.1,  we  obtain 


lAb±J  |  <  2_t  (a*  aj)  +  0.11-2'*  lla.II  ||aj  |  . 

Using  the  relation  |(a?  a.)|  <  ||a  j|  j|a  || ,  we  find 

J  ^  J 

||A(ATA)  <  1.11-2'*  ||A,2  . 


(7) 


In  the  same  way, 

||A(ATf)||  <  1.11‘2~*  ||A||  ||f||  .  (8) 

In  the  case  of  fixed  point  (fig),  we  have 

||A(ATA)||  <  n2't"1,  ||A(ATf)||  <  n1/2  2"*’1  (9) 

with  the  assumption  that  ||a  |j  <  l-2~*  ,  ||f||  <  1-2  *'\ 

Let  us  now  estimate  the  equivalent  perturbation  due  to  the  roundoff 

error  in  the  application  of  the  forward  step  in  the  method  of  square 

roots,  i.e.,  in  the  decomposition  of  the  matrix  of  system  (4)  into  the 

product  of  two  triangular  matrices.  It  is  known  that  the  triang..lar 
T 

factors  S  and  S  of  the  matrix  B  that  are  really  obtained  in  the 
machine  are  the  exact  factors  of  a  certain  matrix  B+E,  that  is 

B  +  E  -  SST.  (10) 
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The  norm  of  C  is  indeed  of  interest  to  us  as  the  norm  of  the  total  error 

T 

in  the  coefficients  of  system  (2),  while  the  norm  of  the  vector  A(A  f) 

is  the  norm  of  the  error  in  the  right-hand  side  of  the  system. 

From  (ll)  in  calculations  with  floating  point,  neglecting  terms 
-2t 

of  order  2  ,  we  have 
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n  p  pf  ^  1“1  O  p  k  ^  p  p 

I  ef  <  2-^  I  (  2  s*  s*  +  4s*  +  2  8*7,  s  ) 

i,j=l  i=l  j=l  1J  j=i+l  J 


<2-2-2t(  is2  s2  +  Z  s2  s2n) 
i,j=l  i,j=l  J 


“•2’2t  *  4,  4 

irj=l 


<  4-2‘2t  max  s2  2  s2  =  4-2_2t||s||4. 

J  ii 0=1  J 


Considering  that 


.Znsij  3  1  8iJ  3  bii  +  eii  3  ,£,aji  +  ^ii  +  eii 

i=l  j=i  0=1 


N 


2  a*  [1  +  0(N2_t)]  >  ||a  f  [1  +  0(N2-1')], 
J-l  J 


s-tx 


we  obtain  ||s||  »  ||A||[l  +  0(N2-t)],  where 


||E||  <  2-2_t  ||A||2(1  +  0(N2't')). 


-t> 


(14 


As  V.  V.  Voyevodin  observed,  these  considerations  permit  us 

to  obtain  an  estimate  of  the  equivalent  perturbation  for  the  method 

of  square  roots  which  is  n  times  better  than  that  suggested  in  [2] 

without  the  assumption  of  accumulation. 

Actually  from  the  above  explanation  it  follows  that  with  an 

_2t 

accuracy  up  to  quantities  of  order  0(2  ) 


He  II  <  2.2_t  [Z  (Z  s2  J2]l/2  <  2*2_t  ||STS|| 
j  i  J 

=  2-2_t  ||SST||  =  2*2_t  |)B | . 

Passing  from  the  Euclidean  norm  to  the  spectral  norm,  we  obtain 

||E|[  <  2 ■2_t(Sp  B)1/2  max  s .  .  <  2*2_t(n  max  X2)1/2 

i  11  -  i  1 

=  2*2't  n1//2  ||B||2  . 

This  estimate  is  n  times  better  than  the  one  obtained  in  [2],  for 
example.  For  fixed  point,  an  estimate  analogous  to  (14),  derived  from 
(12)  with  the  assumption  that  ls^jl  <  has  the  form 

l|E ||  <  n2't‘1(l  +  0(i))  .  (15) 


Using  the  relations  (5)-(l5),  we  obtain  finally 


lie  II 

-tl 

<  N2 

||A||2(1  +  0(i)),  |b(ATf)||  <  N2  t;L 

I|A||  ||f||  ; 

(«) 

lie  h 

<  2-71*2 

_t  ||A||2,||A(ATf)||  <  1.11*2“*  ||A|| 

Ilf  il  J 

(f*2) 

llcli 

<  Nn2_t" 

1(1  +  0(|)),  ||A(ATf)i|  <  Nnl/2  2" 

t-l 

J 

(fi) 

licit 

<  n2’t(l 

+  0(i)),  ||A(ATf)||  <  nl/2  2't'1 

) 

(fi2) 

spectively,  for 

the  calculation  of  the  elements 

T 

of  A  A  and 

ATf 

i  the  cases  of 

tl,  tl2,  fi,  fig. 
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2)  We  will  now  estimate  the  equivalent  perturbation  for  the  errors 
in  Method  B,  which  is  equivalently  an  estimate  of  the  errors  in  the  elements 
of  the  system 

PTPx  =  P Tl  (16) 

which  were  obtained  because  of  the  inaccurate  calculation  of  P  and  £. 

Let  'is  denote  by  AP  and  A£,  respectively,  the  matrix  and  the  vector 

error.  Because  of  these  errors,  instead  of  (l6)  we  obtain  the  perturbed 

system  (P  +  AP)^(P  +  AP)x  =  (P  +  AP)^(£  +  A£).  Neglecting  the  products 
T  T 

AP  AP  and  (AP)  Al,  we  obtain  for  the  perturbations  the  approximate 
equali ties 

A(PTP)  =  PTAP  +  (AP)TP,  A(  PTl )  =  PTA£  +  (AP)T£, 

from  which 

||a(ptp)!I  <2||p||  ||ap||  ,  ||A(pTjt)||  <  ||p||  Ml  +  ||aph  |U|i  . 

Because  of  the  orthogonality  of  the  matrix  of  transformation  Q, 
we  have 

||P||  -  M  =  ||A||  and  ||A||  =  ||Qf||  -  ||f|,  , 

whence 

||A(PTP)||  <2||A||  ||AP||  ,  i|A(PT£)||  <  ||A||  Ml  +  Ii£||  ||AP||  .  (17) 

In  order  to  obtain  final  results  it  is  necessary  to  estimate  the  norms 
of  AP  and  Aj£.  These  estimates  essentially  depend  on  the  actual  method 
of  obtaining  P,  i.e.,  the  method  of  transforming  the  system  of  simultan¬ 
eous  equations  into  system  (2').  To  obtain  the  matrix  P  we  will  eliminate 


8 


the  elements  of  matrix  A  for  which  i  >  j.  We  will  perform  the 
elimination  with  the  help  of  a  matrix  of  rotation  or  reflection  [3]. 
Moreover,  we  will  designate  "by  0^,0^,...  constants,  which  depend  on  the 
actual  method  of  rounding  in  the  machine.  According  to  the  assumptions 
of  [1],  these  constants  are  not  more  than  a  few  units  or  1-2  tens. 

(l)  The  transformation  of  matrix  A  is  accomplished  with  the 
help  of  a  succession  of  elementary  rotation  matrices  T^j  in  a  cyclic 
order  (Method  B^).  Each  of  these  rotations  eliminates  the  element  standing 
in  the  (i,j)-th  position. 

The  roundoff  error  during  the  corresponding  process  of  eliminating 
the  subdiagonal  elements  of  the  square  matrix  was  investigated  in  [1, 
Chapter  3], where  elimination  by  columns  was  examined.  In  our  case  it 
is  more  convenient  and  necessary  to  eliminate  elements  by  rows,  i.e.,  in 

the  order  (2,l),  (3,l),  (3,2) ,  (U,l), . . . ,(n,n-l),  (n+l,l) - ,(n+l,n), . . . , 

(N,n).  It  can  be  shown  that  the  roundoff  error  in  the  elimination  of 
elements  by  rows  and  columns  is  the  same. 


Without  stating  the  calculations,  which  are  like  those  examined  in 
[1,  Chapter  3]>  but  which  are  even  more  cumbersome,  let  us  write  the  final 


result  for  the  i-th  column 


of  the  error  matrix  Ap; 


llAjJi  <  ajL2"t[n(N-n)  +  ^^■]1/2(N+n-2)1'/  (l+6,2't)^N+n'3^||ai  ||  (: 

in  the  case  of  computing  with  floating  point.  In  the  same  way  an  estimate, 
with  the  substitution  of  ||f|j  for  ||a^||,  is  verifiable  for  the  error  of 
transforming  the  column  of  the  right-hand  side.  Here  the  calculation  of 
scalar  products  with  double  precision  has  not  been  assumed.  This  cannot 
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essentially  change  the  estimate  since,  in  the  process  under  consideration, 
we  do  not  encounter  the  calculation  of  scalar  products  of  a  vector  of 
more  than  the  second  order. 

In  computing  with  fixed  point 

IIaJI  <  a,  2_t[n(N-n)  +  ;  (19) 

moreover,  for  it  to  be  possible  to  compute  with  fixed  point  it  is  suf¬ 
ficient  that 

liajl  <l-o^  2_t[n(N-n)  +  . 

The  same  estimate  is  correct  for  the  error  of  rotating  the  right-hand  side. 

The  estimate  obtained  is  exactly  like  that  given  in  [1,  Chapter  3], 

where  actually  the  fact  that  the  transformed  matrix  is  square  is  not  used. 

Considering  that  ||AP||  ■  (  £  ||A.  ||2)1//2,  we  obtain  from  (l8) 

i-1  x 

||AP||  <  Nn1//2  2_t  ||a||  ,  llAil!  <  Nn1/2  2"1  ||f|| 

for  floating  point.  In  the  same  way  from  (19)  we  obtain 

||^>||  <  Nnl/2  2“t||A||,  llAAil  <  Nn2_t 

for  fixed  point. 

(2)  Errors  can  be  reduced  essentially  if  one  uses  rotation  matrices 
with  the  order  of  elimination  of  the  unknowns  that  is  suggested  in  [4] 
(Method  B2). 

Let  us  denote  by  M  the  number  of  cycles  required  for  the  transforma¬ 
tion  of  A  into  triangular  form.  The  estimate  computed  in  [4]  for  our 
case  takes  on  the  form 
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||AP||  <  o^2-tM(l  +  6.2‘t)M‘1|iA||  ,  ||A£||  <  0^(1  +  6-2_t)M"1||f||  (20) 

for  floating  point,  and 

||AP||  <  0^2' V^2  nl/2(n(N-n)  +  , 

(21) 

\\NL\\  <  Q^2-tM1//2[n(N-n)  +  ZLZ^l]1/2 
for  fixed  point. 

For  an  estimate  of  the  value  of  M  let  us  note  that  the  number  of 
cycles  is  independent  of  the  actual  realization  of  the  process  suggested 
in  [4]  if  one  does  not  consider  zero  elements  of  the  initial  matrix  or 
any  elements  accidentally  zeroed  in  one  elementary  transformation.  For 
the  elimination  of  the  m-1  elements  of  the  matrix  consisting  of  m 
rows  and  one  column,  [log2(m-l)j  +  1  cycles  are  required.  Here  the 
square  brackets  denote  the  integer  part. 

Let  the  matrix  have  N  rows  and  n  columns.  For  the  elimination 
of  all  the  elements  of  the  first  column  except  the  first  element,  one 
requires  [^^(N-l)]  +  1  cycles.  With  these  it  could  happen  that  some 
of  the  elements  of  other  columns  are  eliminated.  However,  even  if  one 
disregards  the  last  situation  for  the  elimination  of  elements  of  the 
second  column,  [log^N^)]  +  1  cycles  are  required,  etc. 

Finally,  we  obtain 

n 

M  <  Z  [log^N-k)  +  n  <  nfflog^N-l)]  +  1)  . 

ksl 
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This  estimate  is  a  little  excessive,  but  not  by  more  than  4-5  times  for 
N  <  100000. 

Using  this  estimate  for  M,  we  find  from.  (20)  and  (21 ) 

INI  <  a3  n  loe2  N,2"tHAll  »  Nil  <  “3  n  l0S2  H*2*t||f|| 

for  floating  point  and 

INI  <  c*4  n3,/2(N  loggN)1/2,  ||A£||  <  n(N  logg  N)1^2 

for  fixed  point. 

(3)  Using  a  matrix  of  rotation  (Method  B_)  for  the  elimination  of  the 
•  lemon ts  of  A  appears  most  expedient  in  that  case  where  the  scalar 
products  are  calculated  with  double  precision.  Moreover,  the  estimates  for 
AP  and  A£  are  practically  independent  of  N.  Let  us  assume  here  that 
the  calculation  is  carried  out  in  floating  point.  The  results  obtained  in 
[1,  Chapter  3]  go  for  rectangular  matrices  A  and  give 

INI  <  Q^(n-l)2“'bl|A||,  Ml  <  0^(n-l)2’t||f||. 

Jlaving  substituted  the  estimate  received  for  AP  and  A i  ■  into  (17), 

T 

we  obtain  a  final  estimate  of  the  norm  of  the  error  matrix  A(P  P)  and 

T 

trie  error  vector  A(P  i)',  namely, 
for  method  B^: 

;|A( PTP)11  <  aLNnl/22‘t||A||2,  ||a(PT£)||  <  c^'VVll  ||f||  (f i) 

!|a(PTP)||  <  c^Nn22-t,  ||a( PT£||  <  C^Nn3/22_t  ;  (fi) 
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for  method.  Bgt 


NpTp)ll  <  03  n  loggN-B'^All2,  |KPTi)||  <  a^n  log2N*2"t||A||  ||f||  (fl) 

||a(PTP)||  <  C^n2(N  loggN)1^2,  |(A(PTje)||  <  C^n3/2(N  lo^N)1/2;  (fi) 


for  method  B^: 


||A(PTP)||  <  Q^(n-l)2"^||Ajj2,  |KP^)II  <  a5(n-l)2't||A||  ||f|| 


Comparing  the  obtained  results,  ve  see  that  the  estimates  of  the 
equivalent  perturbations  for  the  matrix  of  system  2  have  the  form 
2"tcp(N,n)|lA||2  and  2“tf(N,n),  respectively,  for  the  different  methods 
of  calculation.  In  the  table  the  order  of  magnitude  of  the  functions 
cp(N,n)  and  $(N,n)  are  set  forth  (N  »  n). 


fi 

•••  . -  • 

Type  of  Computation 

» 

Method 

tl2 

fi  ! 

^2 

A 

I 

N 

l 

const 

■  nN  j 

n 

B1 

j  n1/2N 

l/2„ 
n  '  N 

n2N  j 

n2N 

B2 

j n  loggN 

n  loggN 

{  p  i/2  1 

,  n  (N  loggN) 

n2(N  log^N) 

B3 

i  nN 

n 

!  2 
|  n  N 

n2^/2 

In  this  table  it  is  seen  that  a  comparison  of  Methods  A  and  B,  in 
the  sense  of  majorizing  the  estimate,  goes  as  a  rule  in  favor  of  Method  A. 
Method  B2  is  the  elimination  method. 
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Let  us  go  now  from  the  equivalent  perturbations  to  the  error  in 
the  solution  of  the  system.  It  is  not  difficult  to  construct  an  example 
in  which  with  Method  one  obtains  an  order  of  the  norm  of  the  error  in 
the  solution  which  is  equal  to  the  largest  estimate  of  Method  A  without 
accumulation.  Let  u.i  examine,  for  example,  the  system  with  a  matrix  of 
coefficients  and  a  right-hand  vector,  respectively,  of  the  form 


i  =  J, 

i  /  J,  i  <  n, 
i  >  n. 


i  <  n; 

i  >  n; 


where  e  «  1,  so  that  n(N-n)  e  <  1. 

Let  us  consider  that  computations  are  carried  out  with  fixed  point, 

and  that  the  elementary  matrix  rotations  are  computed  exactly.  Assume 

that  multiplication  by  these  matrices  is  equally  exact.  After  each 

multiplication  by  an  elementary  matrix  of  rotation,  one  rounds  off  the 

elements  obtained  up  to  a  t  digit  number  with  fixed  point,  which  gives 

an  error  of  2  t-  .  It  is  possible  to  assume  that  in  this  situation  the 

elements  of  AP,  wnich  stand  on  the  main  diagonal  and  above,  have  the 

form  (N-n)2  1  +  0(n(N-n)  e  2  ^).  Also,  the  components  of  the  vector 

At  have  this  form  with  numbers  which  are  not  larger  than  n. 

Let  us  designate  by  Ax  the  vector  of  the  error  of  the  solution. 

When  (P  +  AP)(x  +  Ax)  =  l  +  At,  then,  neglecting  the  product  APAx,  we 

obtain  Ax  =  P  ^ At  -  APx).  Having  computed  P  1  and  x,  we  obtain 

j|Ax||  =  0((N-n)^^2  Z).  The  same  order  for  ||Axl1|  wis  obtained.  In-Method 

T  "1  T 

A  if  one  uses  the  identity  Ax  *  (A  A)  (A(A  f)  -  Cx)  and  the  maximum 
estimates  for  A(A  jt)  and  C. 
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